Data Visualisation

Ben Whalley

Nothing new under the sun: Mediaeval line plot, circa 1010. Image: wikimedia


Facets

As we add layers our plots become more complex. We run into trade-offs between information density and clarity.

To give one example, this plot shows life expectancies for each country in the development data, plotted by year:

development %>%
  ggplot(aes(year, life_expectancy, group=country)) +
  geom_smooth(se=FALSE)

Explanation: This is another x/y plot. However this time we have not added points, but rather smoothed lines (one for each country).

Explanation of the code:We have created an x/y plot as before, but this time we only added geom_smooth (and not geom_point), so we can’t see the individual datapoints. We have also added the text group=country which means we see one line per-country in the dataset. Finally, we also added se=FALSE. This hides the shaded area which geom_smooth adds by default.

Comment on the result: It’s pretty hard to read!


To increase the information density, and explore patterns within the data, we might add another dimension and aesthetic. The next plot colours the lines by continent:

development %>%
  ggplot(aes(year, life_expectancy, colour=continent, group=country)) +
  geom_smooth(se=FALSE)

However, even with colours added it’s still a bit of a mess. We can’t see the differences between continents easily.

To clean things up we can use a technique called facetting.

development %>%
  ggplot(aes(year, life_expectancy, group=country)) +
  geom_smooth(se=FALSE) +
  facet_grid(~continent)

Explanation: We added the code + facet_grid(.~continent) to our previous plot, and removed the color aesthetic. This made ggplot create individual panels for each continent. Splitting the graph this way makes it somewhat easier to compare the differences between continents.

Try facetting yourself

Use the iris dataset, which is built into R.

  1. Try to recreate this plot by adapting the code shown above

  2. Create a second plot which uses colours to distinguish species and does not use Facets

  3. In this example, which plot do you prefer? What influences when facets are more useful the just using colour?

iris %>%
  ggplot(aes(Sepal.Length, Petal.Length)) +
  geom_point()  +
  geom_smooth() +
  facet_grid(~Species)
iris %>%
  ggplot(aes(Sepal.Length, Petal.Length, color=Species)) +
  geom_point()  +
  geom_smooth() 

There’s no single right answer here, but for this example I prefer the coloured plot to the faceted one. The reason is that there are only 3 species in this dataset, and the points for each don’t overlap much. This means it is easy to distinguish them, even in the combined plot.

However, if there were many different species it might be helpful to use facets instead.

Our decisions should be driven by what we are trying to communicate with the plot. What was the research question that motivated us to draw it?

Adjusting the facets

In the explanation above, we saw this plot:

development %>%
  ggplot(aes(year, life_expectancy, group=country)) +
  geom_smooth(se=FALSE) +
  facet_grid(~continent)

Re-run the plot and make the following adaptations:

  1. Try replacing facet_grid(~continent) with facet_grid(continent~.). What happens?

  2. Try replacing facet_grid with facet_wrap(~continent). What happens?

  3. If you have time, to see more faceting examples, see the ggplot cookbook documentation. Try to re-create some of these examples.

Comparing categories

In the examples above we have been plotting continuous variables (and adding colours etc). We’ve used density, scatter and smoothed line plots to do this.

Another common requirement is to use plots to compare summary statistics for different groups or categories. For example, the classic plot in a psychology study looks like this:

However, there is evidence that readers often misinterpret bar plots. Specifically, the problem is that we perceive values within the bar area as more likely than those just above, even though this is not in fact the case.

A better choice is (almost always) to use a boxplot:

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  geom_boxplot()

Explanation: We used Condition, a category, as our x axis, and reaction times as the y axis. We added geom_boxplot to show a boxplot.

If you’re not familiar with boxplots, there are more details in the help files (type ?geom_boxplot into the console) or use the wikipedia page here

Loading data over the internet

So far we have always used data from the psydata package. However, we can also load data from files or over the internet using the read_csv() function. This is an example:

expdata <- read_csv('https://gist.githubusercontent.com/benwhalley/f94baf447612e2434b181739dbba27df/raw/43df26022fff68f49918c795f27d7352dc0d3425/expdata.csv')

expdata %>% glimpse
Rows: 240
Columns: 4
$ Condition <chr> "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"…
$ stimuli   <chr> "S1", "S1", "S1", "S2", "S2", "S2", "S3", "S3", "S3", "S4", "S4", "S4", "S1", "S1", "S1", "S2", "S2", "…
$ p         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ RT        <dbl> 151.84776, 313.68298, 556.61563, 53.63154, 560.36673, 586.03610, 504.98535, 271.76747, 267.13422, 408.1…

Explanation: The part which starts expdata <- read_csv('https://... loads a CSV (data) file over the internet, and stores it in the variable called expdata. We then use glimpse to check the data

Load the (simulated) dataset from this url: https://gist.githubusercontent.com/benwhalley/f94baf447612e2434b181739dbba27df/raw/43df26022fff68f49918c795f27d7352dc0d3425/expdata.csv (note that this URL is very long — make sure you copy the whole of it).

  • Recreate the boxplot above
  • Use a facet to recreate the plot you saw above, combining both Condition and Stimuli

`

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  geom_boxplot() +
  facet_wrap(~Condition)

`

Other data summaries

If you really need to plot the mean and standard error or standard deviations of different categories, ggplot also has the stat_summary command:

expdata %>%
  ggplot(aes(Condition, RT)) + 
  stat_summary()

Explanation: We used Condition and RT as our x and y axes, as before. This time we added stat_summary() instead of geom_boxplot(). By default this plots the mean and standard error (a measure of variability) in each group, using a point-range plot. This is better than a bar chart because it avoids a known bias in how we read them. You can ignore the warning about No summary function supplied, defaulting to mean_se() for now.

  • Adapt your facetted boxplot from above to show the mean and standard error instead
  • Can you combine both boxplot and summary in a single plot?

Or, combined:

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  stat_summary() +
  facet_wrap(~Condition)

Or, combined:

expdata  %>%
  ggplot(aes(x=stimuli, y=RT)) + 
  geom_boxplot() +
  stat_summary(color="blue") +
  facet_wrap(~Condition)

Decisions about grouping

Consider the following two plots. Both present the same data on the incomes of men and women who are either blind or normally sighted:

  1. Which plot do you prefer, and why?

I don’t have a strong overall preference between the plots, but I think plot A is probabably more useful. My reasoning is that in this plot we can see that:

  1. Men are paid more
  2. Blind people are paid less
  3. The gender-pay gap appears to be larger for blind people

I think that third point is slightly harder to read in plot B.

The idea to take away is that grouping the data in the plot in different ways can make it easier or more difficult to identify different patterns.

  1. Re-organise the plot below to make it easier to see which level of education has the highest gender pay gap.
earnings  %>%
  ggplot(aes(x=education, y=income/1000)) + 
  facet_grid(~gender) + 
  stat_summary() +
  labs(y = "Income (£ 1000)", x="Gender")

earnings  %>%
  ggplot(aes(x=gender, y=income/1000)) + 
  facet_grid(~education) + 
  stat_summary() +
  scale_y_log10() +
  labs(y = "Income (£ 1000)", x="Gender")

  1. Try adding scale_y_log10() to your updated plot. How does this change the impression you have of the data? Which plot is the most truthful?

Data-vis practice

Scenario 1: Secret agent

You are an MI6 agent, and have been sent a mystery dataset by one of your spies. She said it contains highly important information which will be of great interest to your superiors.

  1. Use your ggplot wizardry (and techniques we have covered today) to recover this classified information and present it in a single graphic for Q.

You can load the data using a new command, read_csv(). This command accepts either the name of a file, or a URL that the data is stored on the web.

The file URL is: https://gist.githubusercontent.com/benwhalley/2103941f0c2ac0449247f59fe9c976b6/raw/4630367728610f0a9937d71eeac6fb511065ee46/mystery.csv

To load the data, you might write:

mystery <- read_csv('https://gist.githubusercontent.com/benwhalley/2103941f0c2ac0449247f59fe9c976b6/raw/4630367728610f0a9937d71eeac6fb511065ee46/mystery.csv')

mystery %>% glimpse
Rows: 1,846
Columns: 3
$ category <chr> "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D",…
$ x        <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410, 33.0769, 28.9744, 26.1538, 23.0769, 22.30…
$ y        <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718, 77.5641, 74.4872, 71.4103, 66.4103, 61.79…

Scenario 2: Admissions tutor

In the 1970s the University of California, Berkley, was concerned about the fairness of their admissions procedures. They collected data from across the university for a number of years, recording the:

  • Number of applicants
  • The department the student applied to
  • The students’ gender
  • Number of students accepted
  • The percentage students of each gender who were accepted in each department

A summary of these data are available at this link: Berkley applications data

Your job is to:

  • Describe the pattern of applications
  • Decide if the university was fair in it’s admissions procedures
  • Prepare a short presentation for the university governors which includes plots

Techniques/commands you might want to use:

  • filter
  • group_by and summarise
  • stat_summary
  • facet_wrap() or facet_grid()
berkley <- read_csv('https://raw.githubusercontent.com/benwhalley/datafluency/master/website/data/berkley.csv')
berkley %>% glimpse
Rows: 72
Columns: 6
$ year             <dbl> 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1976, 1977, 1977, 1977, 1977, …
$ gender           <chr> "Female", "Female", "Female", "Female", "Female", "Female", "Male", "Male", "Male", "Male", "Mal…
$ department       <chr> "A", "B", "C", "D", "E", "F", "A", "B", "C", "D", "E", "F", "A", "B", "C", "D", "E", "F", "A", "…
$ n.applicants     <dbl> 133, 34, 785, 462, 511, 444, 1095, 740, 416, 536, 234, 485, 145, 33, 762, 477, 503, 444, 1065, 7…
$ n.admitted       <dbl> 107, 20, 314, 146, 149, 318, 725, 455, 141, 205, 62, 297, 108, 21, 256, 152, 131, 273, 757, 389,…
$ percent.admitted <dbl> 80.45, 58.82, 40.00, 31.60, 29.16, 71.62, 66.21, 61.49, 33.89, 38.25, 26.50, 61.24, 74.48, 63.64…

This plot suggests a paradox: Although more men applied and were admitted, the proportion of male applicants who were successful was actually lower than for women:

Breaking the data down by-department reveals what is happening:

Note that you don’t (yet) know how to produce the plot above. We will cover this in the next sessions, but for now you could make two individual plots, like so:

This is an example of Simpson’s paradox. You can read more about the case here: http://corysimon.github.io/articles/simpsons-paradox/